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ABSTRACT 

The use of "Dias elimination procedures" to reduce 
the racial bias of test items is discussed. These procedures were 
forwarded Dy G. R. Anrig (1988) and R, L, Linn and F, Drasgow (1987), 
Anrig stated that subjects who "Knew the same amount aDout a test 
item" should have a similar chance of answering it correctly 
"regardless of their race, sex, or ethnic background," Linn and 
Drasgow stated that an adequate approach to detecting item bias 
"requires a means of distinguishing between differences that ar-^ due 
o group differences in the developed skills of the test takers and 
those that are due to extraneous factors," The latter researchers 
propose a one-dimensional item response theory (IRT) criterion. 
However, i:his procedure provides no agreed external criterion for 
making a judgment concerning bias, A requirement to select those 
items that minimize group differences on the final test does appear 
to meet a general requirement for equity? this is the procedure 
forwarded Dy the Golden Rule Insurance Company in its debate with the 
Educational Testing Service, It is concluded that: (1) there may be 
no purely technical solution to the problems of test bias? (2) the 
test construction process should reccgnize the need to make 
ideological and social choices? and (3) IRT theory will not provide a 
solution, (TJH) 
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1 INTRODUCTION 

Recent articles in the journal 'Educational Measurement: Issues and Practice' (in Summer 1987, 
and Soring 1988) have explored issues arising from the debate between the Golden Rule 
Insurance Company (GRIC) and Educational Testing Service (ETS). The debate has centered 
around the procedures originally agreed between GRIC and ETS for minimising BlacK^ne 
testscorc differences. These procedureswerebaseduponchoosingthosetestitemswhich showed 

the smallest group differences alter various standard item screening techniques had been 
employed to yield candidate items for inclusion in the test 

Among the important political and social issues which this debate has highhghted, is that of the 
relationshipbeVweenthetechnicalcha cteri.ticsofatestandits social impact. This relaaonship 
however, is only explored panly in the above articles and the purpose of the present paper i to 
extend the debate by questioning whether the 'bias elimination' procedures discussed by Lmn 
and Drasgow(1987) and Anrig(1988) really address the point at issue. First, however, some 
historical perspective may be useful. 



2 HISTORICAL PRECEDENCE 

It seems clear that one of the major motivations for introducing large scale testing, bo± for 
xLple in the U.K. During the 19 Th century and the U.S.A. In the early 20th century ^ 
concern with equity and an attempt to select on merit. At ^he same time as ^ould a981) 
demonstrates, cultural expectations played an important role in the way in which the tests 
functioned, and in particular the patterns of differences between groups of the population. As 
Weiss(1987) points out, it is possible to construct tests, by selecting appropnate item contexts, 
to reduce or reverse BlackWite differences. Goldstein ( 1986) makes a similar point in relation 
to gender differences. Thus, in part at least, observed group differences may be have as much 
torwiilcultural expectationsof test designers as with any 'real'differences in ^^^^ 

or ability. 

One of the ways in which new tests are validated is to compare the performance of their new 
items against items from an existing, often well established, test. It is not too 
how such a procedure can become an effective vehicle for the perpetuation of old culuiral 
expectations about how items should discriminate between groups. ly^^^^VJl^^f^'^^^j;^^^^ 
effVct has probably been mitigated by the attention to item content and a sensitivity to eAnic 
and sexual stereotyping. Nevertheless, ethnic and gender response differences remain (this is 
It the G "le dispute is all .bout) and a key question is whether such differences are 
'artifacts' of the items or in some sense 'real'. The next section looks at this issue m detail. 



3 THE REALITY OF GROUP DIFFERENCES 

In the article by Linn and Drasgow, and that by Anrig, there is an implicit assumption that a 
measure is available for the true 'skill' or 'ability' a test item is supposed to measure. Anng, 



1 



3 



for example, says that subjects who 'know the same amount about a test item' should have a 
similar chance of answering it correctly 'regardless of their race, sex, or ethnic background'. 
Linn and E>rasgow state that an adequate approach to detecting item bias 'requires a means of 
distinguishing between differences that are due to group differences in the devi oped skills of 
the test takers and those that are due to extraneous factors ' . The problem, of course, is to measure 
in some suitable way these 'skill' or 'knowledge' factors. In practice, test constructors use the 
total set of items availrble (or some equivalent test) to provide such measures, and then to 
identify unusual or 'outlying' items as candidates for possible bias. 

Linn and Drasgow propose a 1-dimensional item response model criterion whereby an item is 
judged to be unbiassed if its characteristic or response curve is the same in each group except 
for a shift in location along the 'ability' scale. Unfortunately, any difference along the ability 
scale can be interpreted either as a 'real' difference bet^^een groups, or as a between group 
'bias'. Since the ability scale is estimated effectively as a weighted average function of the item 
responses in each population, such a procedure simply detects 'outlying' items. Thus, for 
example, if all the test items show a similar group difference, it is a matter of judgement whether 
we wish to interpret this as a biassed test or a real group difference or some mixture of these 
two. The point is that there is no agreed external criterion for making a judgement, and no 
amount of statistical modelling can avoid that fact. One of the problems with Item Response 
Theory (for 'theory' read 'models') is that its mathematical complexity masks its logical 
inadequacy. 



4 THE LEGITIMACY OF GOLDEN RULE PROCEDURES 

If it is accepted that 'technical' approaches to detection of item bias are inadequate because they 
are essentially tautological, we are left with the bestef forts of test constructors and the elimination 
of content bias while attempting to maintain educational or psychological relevance. Some 
writers on this topic (see for example, Humphreys, 1986), base their recommendations on an 
assumption about the existence of a valid outcome criterion against which group test differences 
can be judged. Needless to say, this begs the question of how such a criterion is to be found. If 
bias exists then it will have affected all the criteria which could be judged relevant. Such an 
approach hardly seems helpful. 

At the end of the day, as the Golden Rule controversy highlights, the test constructor is left with 
candidate test items which are technically similar and also adequate, for example by having 
reasonable within-group discriminations. Nevertheless, some will show smaller and perhaps 
reversed group differences than others. Naturally, as several writers have pointed out, a 
mechanical application of a formula, such as that agreed between ETS and GRIC, can lead to 
bad judgements. Nevertheless, a requirement to select those items which minimize group dif- 
ferences on the final test, does appear to meet a general requirement for equity. Bond (1987) 
takes a similar view. Indeed, given the historical circumstances of test construction, we might 
also require test constructors to search and develop systematically items which showed opposite 
effects to those expected. 
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Interestingly, Anrig's (1988) reply to Rooney (1987) docs not question such a procedure in 
general terms. Indeed, it is difficult to see that ETS or any other test agency could so do since, 
by definition, the items which are available for consideration, after standard vetting procedures 
have been carried out, are to all intents and purposes technically equivalent; except that they 
exhibit group differences. 

5 CONCLUSIONS 

Having argued in favour of the general principle of a procedure like that of the Golden Rule, I 
am left with two outstanding questions. 

The first concerns how we might operate such a procedure, both politically and technically. I 
do not propose to discuss this in detail here, save to point out that if it is to be taken seriously, 
it is not a matter which should be left to the testing agencies. It is a matter of social concern 
which goes beyond those agencies and the testing profession too: at the least it should involve 
educational professionals drawn from diverse backgrounds, including those whose major 
concern is other than testing. 

The second question concerns the attitude of much of the testing profession itself The reaction 
to Golden Rule, at least as expressed in the articles referred to, largely has been to retire into 
technicalities. Thus Linn and Drasgow claim that 'the most widely accepted psychometric 
approach to this problem (of item bias) is based on Item Response Theory'. Widely accepted 
by whom one may ask? Likewise, Anrig refers to 'the theoretical and analytical sophistication 
of the ETS methodology'. It might not be innapropriate to recall an older use of the term 'so- 
phistication ' , namely 'the process of investing with specious fallacies or of misleading by means 
of these' (Oxford English Dictionary). 

It seems that the testing profession needs to develop a greater awarene ss of the social and political 
implications o^ its techniques. It also needs to display a greater willingness to expose the logical 
structures of its .echniques, rather than reverting to mathematicisation, when challenged. If it 
cannot undertake such tasks itself, then it should not be too surprised if others seek to do this 
for it. 



6 SUMMARY 

The dispute between ETS and the Golden Rule Insurance Company raises important social 
issues. It is a jued that the testing profession needs to recognise that there may be no purely 
technical solutions to problems of test bias and that the process of test construction and analysis 
should recognise the need to make ideological and social choices. In particular it is argued that 
the use of item response models to attempt to resolve problems of test bias, is both innapropriate 
and misleading. 
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7 POSTSCRIPT 



This paper was submitted for publication in the journal 'Educational Measurement: Issues and 
Practice*. It was sent to three referees, all of whom recommended outright rejection! Their 
detailed responses neatly illustrate many of the argunients of the paper itself, and are worth a 
brief summary. 

All three referees claim that there is nothing original in the paper; a somewhat curious reason 
for rejection. If a critique is appropriate, then the fact that it draws upon earlier arguments hardly 
seems germane. More importantly, the general attitude of the referees is to deny that any fun- 
damental problem exists. 

Thus, one referee claims that 'good' tests show no prediction bias against minorities and fur- 
thermore that tests which reduce Black-White differences do not predict well. She or he in effect 
is unwilling lO recognise that there i s a problem and concludes with the somewhat condescending 
tautology thai 'disadvantaged people really are disadvantaged'. 

Another referee claims never to have been aware that the cultural expectations of test developers 
have resulted in detriment to any group. This of course precisely illustrates my point. The cultural 
conditioning of a person does not allow easily that person to observe the effect that the culture 
itself is exerting. Somewhatrevealingly, this referee takes me to task for stating that item response 
models are mathematically complicated. 'Actually\ she or he says, 'its very simplicity.. ..is one 
of its claims to elegance', and goes on to argue that such models 'may best serve where dis- 
agreement exists with respect to qualitative criteria for making decisions regarding bias' . I 
think it would not be too much of a distortion to summarise this stance as 'the model is too 
elegant not to be true'. Needless to say, one person's elegance may be another person's over- 
simplification. 

Finally, this same referee echoes what I believe is a typical response of the testing profession. 
He or she refers to good test development practice lying in the adherence to ' test specifications, 
a blueprint prepared by subject matter experts who are knowledgeable about the purpose of the 
test and the characteristics of the intended test-taking population.' This effectively closes the 
circle around the profession. Thereal issue, however, remains. Namely, how long the profession 
can continue to ignore the claims of those who do not wish to share all, or even some, of its 
basic assumptions and beliefs. 
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